Improve SLM Health Indicator to cover missing snapshot #121370

samxbr · 2025-01-31T08:12:04Z

Currently the SLM health indicator in health report turns YELLOW when snapshots fail for a number of times. However, the SLM health indicator stays GREEN if snapshot is not completed (no success or failure) for a long time. This change adds a new optional setting unhealthy_if_no_snapshot_within to SLM policy, that sets a time threshold. If the SLM policy has not had a successful snapshot for longer than the threshold, the SLM health indicator will turn YELLOW.

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

samxbr · 2025-02-04T15:30:17Z

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

+        if (policy.getLastSuccess() != null) {
+            // prefer snapshotStartTimestamp over snapshotFinishTimestamp in case of a very long-running snapshot
+            // that started a long time ago
+            SnapshotInvocationRecord lastSuccess = policy.getLastSuccess();
+            return lastSuccess.getSnapshotStartTimestamp() != null
+                ? lastSuccess.getSnapshotStartTimestamp()
+                : lastSuccess.getSnapshotFinishTimestamp();
+        }
+        // SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp
+


This doesn't yet handle the case for the first snapshot (i.e. no prior successful snapshot). To handle that, I am thinking to record the first SLM trigger time in SLM metadata SnapshotLifecyclePolicyMetadata so here we can check against that. I can do that in a follow up PR if the idea makes sense.

I think it'd be good to handle that particular case, so +1 to investigate it after this PR

elasticsearchmachine · 2025-02-04T15:33:45Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-02-04T15:38:33Z

Hi @samxbr, I've created a changelog YAML for you.

samxbr · 2025-02-04T15:44:21Z

Due to current doc freeze, the doc change for this PR is not added yet. It will likely be in a separate PR after the doc freeze.

dakrone

I left some comments, but this generally looks good! What do you think of the naming and validation questions?

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicy.java

dakrone · 2025-02-04T23:47:25Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicy.java

+                    "invalid missingSnapshotUnhealthyThreshold ["
+                        + missingSnapshotUnhealthyThreshold.getStringRep()
+                        + "]: "
+                        + "time is too short, expecting at least more than the interval between snapshots ["


I'm on the fence about this validation, on one hand, should we actually prevent someone from making the threshold too small, given that they can always manually execute the SLM policy to take a snapshot? I'm not sure it's worth it. What do you think?

I agree that this validation is not strictly necessary, the worse case without it is that the indicator will always be YELLOW if the threshold is set too small.

I am slightly leaning towards keeping it, my philosophy here is to providing quick feedback for user over letting them find out through health indicator. I can't think of a use case where user would set this threshold smaller than the snapshot interval. I also can't think of a drawback for keeping this validation, it's ok if user occasionally execute SLM manually, it just resets the threshold time, and doesn't defeat the purpose of this setting. We can definitely remove this validation if it can have potential negative impact I didn't think of.

I'm leaning towards keeping the validation as well. I also don't really see a reason why we would allow users to do this, but I do see value in avoiding support tickets where users to do this. Plus, we get unhealthyIfNoSnapshotWithin > 0 validation for free.

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

dakrone · 2025-02-04T23:50:50Z

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

+        if (policy.getLastSuccess() != null) {
+            // prefer snapshotStartTimestamp over snapshotFinishTimestamp in case of a very long-running snapshot
+            // that started a long time ago
+            SnapshotInvocationRecord lastSuccess = policy.getLastSuccess();
+            return lastSuccess.getSnapshotStartTimestamp() != null
+                ? lastSuccess.getSnapshotStartTimestamp()
+                : lastSuccess.getSnapshotFinishTimestamp();
+        }
+        // SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp
+


I think it'd be good to handle that particular case, so +1 to investigate it after this PR

dakrone · 2025-02-04T23:51:14Z

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

+                ? lastSuccess.getSnapshotStartTimestamp()
+                : lastSuccess.getSnapshotFinishTimestamp();
+        }
+        // SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp


I think you can change this to just a regular TODO instead of one specific to you :)

Don't forget about this one :)

….java Co-authored-by: Lee Hinman <[email protected]>

samxbr · 2025-02-10T16:00:31Z

Hi @nielsbauman, I added you as a backup reviewer in case @leehinman may not have time to review this week 😄

nielsbauman

I've had a look at most code, I only have the two large test files left to check, which I'll do later today. Looking great so far!

server/src/main/java/org/elasticsearch/TransportVersions.java

nielsbauman · 2025-02-11T00:30:24Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

-     * Runs the code block for the provided interval, waiting for no assertions to trip.
+     * Runs the code block for the provided interval, waiting for no assertions to trip. Retries on AssertionError
+     * with exponential backoff until provided time runs out


Is this change really necessary? I don't really feel like this change is super valuable (it's not wrong, just not absolutely necessary) and since it's the only change in this file, I'd prefer to revert it to reduce the scope of the PR a bit.

I think adding the comment makes it clearer that it will retry until timeout, I wasn't sure about whether it will retry or not from the original comment, and had to read the code. Maybe it's just me though

Hmm yeah that's why I said you're not wrong. I'm just not a huge fan of PRs changing unrelated files. I prefer PRs to have a more defined scope to make them easier to read and for commit history to be a bit more understandable.

I get what you mean, and I would usually agree with you for changes that are less trivial than this. It's pretty common for people to find opportunities for tiny improvements while working on unrelated code. I think it's adding unnecessary friction if we were to split every tiny improvements to separate PRs (like a PR just to change this line of comment). To me it's more important to continuously make tiny improvements to the code base than trying to keep every PR within its scope.

That being said, I agree with you 100% on keeping PR in scope. It's just this one feels too trivial to be "out-of-scope".

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicy.java

nielsbauman · 2025-02-11T00:46:32Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicy.java

+                    "invalid missingSnapshotUnhealthyThreshold ["
+                        + missingSnapshotUnhealthyThreshold.getStringRep()
+                        + "]: "
+                        + "time is too short, expecting at least more than the interval between snapshots ["


I'm leaning towards keeping the validation as well. I also don't really see a reason why we would allow users to do this, but I do see value in avoiding support tickets where users to do this. Plus, we get unhealthyIfNoSnapshotWithin > 0 validation for free.

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java

x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecyclePolicyTests.java

...ster-restart/src/javaRestTest/java/org/elasticsearch/xpack/restart/FullClusterRestartIT.java

nielsbauman

Sorry, I hadn't come around to checking the test classes yet. I've reviewed everything now, so we're almost good to go from my end :)

...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java

nielsbauman · 2025-02-13T04:48:56Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

-     * Runs the code block for the provided interval, waiting for no assertions to trip.
+     * Runs the code block for the provided interval, waiting for no assertions to trip. Retries on AssertionError
+     * with exponential backoff until provided time runs out


Hmm yeah that's why I said you're not wrong. I'm just not a huge fan of PRs changing unrelated files. I prefer PRs to have a more defined scope to make them easier to read and for commit history to be a bit more understandable.

x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorServiceTests.java

nielsbauman

LGTM, thanks for iterating on this, @samxbr!

samxbr added 3 commits January 31, 2025 13:37

SLM WIP

2189670

add constructor

64b10ee

Merge branch 'main' into feature/slm-health

dc371ff

samxbr marked this pull request as draft January 31, 2025 08:12

elasticsearchmachine added the v9.1.0 label Jan 31, 2025

samxbr and others added 4 commits January 31, 2025 16:13

remove last_successful_snapshot_timestamp

a80fb70

[CI] Auto commit changes from spotless

1f35dd0

add more

7bc8e1c

Merge branch 'main' into feature/slm-health

acf81e8

samxbr changed the title ~~Improve SLM Health Indicator to cover for missing successful snapshot for a long time~~ Improve SLM Health Indicator to cover missing snapshot Feb 3, 2025

elasticsearchmachine and others added 7 commits February 3, 2025 12:07

[CI] Auto commit changes from spotless

5cbde97

Merge branch 'main' into feature/slm-health

b6498b9

Add integ tests

366bf48

[CI] Auto commit changes from spotless

791b591

rename

6cf8da2

rename

94f6356

[CI] Auto commit changes from spotless

c18641f

samxbr commented Feb 4, 2025

View reviewed changes

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java Outdated Show resolved Hide resolved

samxbr commented Feb 4, 2025

View reviewed changes

samxbr added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Feb 4, 2025

samxbr requested a review from dakrone February 4, 2025 15:32

samxbr marked this pull request as ready for review February 4, 2025 15:33

samxbr requested a review from a team as a code owner February 4, 2025 15:33

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 4, 2025

samxbr added the >enhancement label Feb 4, 2025

Update docs/changelog/121370.yaml

fc8694b

Merge branch 'main' into feature/slm-health

17471af

dakrone reviewed Feb 4, 2025

View reviewed changes

samxbr and others added 7 commits February 5, 2025 11:41

Update test/framework/src/main/java/org/elasticsearch/test/ESTestCase…

22623c5

….java Co-authored-by: Lee Hinman <[email protected]>

Merge branch 'main' into feature/slm-health

bafd842

Merge branch 'main' into feature/slm-health

691a4dc

rename to unhealthyIfNoSnapshotWithin

96bdf38

[CI] Auto commit changes from spotless

fd96eea

Update help url

467e930

Merge branch 'main' into feature/slm-health

331bd31

samxbr requested review from dakrone and nielsbauman February 10, 2025 15:46

nielsbauman requested changes Feb 11, 2025

View reviewed changes

samxbr and others added 3 commits February 12, 2025 12:55

Merge branch 'main' into feature/slm-health

22a65fd

minor

de772a0

[CI] Auto commit changes from spotless

c86f2da

samxbr mentioned this pull request Feb 12, 2025

Add health indicator impact to HealthPeriodicLogger #122390

Merged

nielsbauman reviewed Feb 13, 2025

View reviewed changes

samxbr added 2 commits February 13, 2025 15:34

change from comments

d844589

Merge branch 'main' into feature/slm-health

14fa3ce

samxbr requested a review from nielsbauman February 13, 2025 08:45

nielsbauman approved these changes Feb 13, 2025

View reviewed changes

samxbr merged commit 5d48ded into elastic:main Feb 14, 2025
17 checks passed

Improve SLM Health Indicator to cover missing snapshot #121370

Improve SLM Health Indicator to cover missing snapshot #121370

Uh oh!

Conversation

samxbr commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 4, 2025

Uh oh!

elasticsearchmachine commented Feb 4, 2025

Uh oh!

samxbr commented Feb 4, 2025

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samxbr commented Feb 10, 2025

Uh oh!

nielsbauman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samxbr Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samxbr Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nielsbauman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nielsbauman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

samxbr commented Jan 31, 2025 •

edited

Loading

samxbr Feb 11, 2025 •

edited

Loading

samxbr Feb 13, 2025 •

edited

Loading